Vector representation based on a supervised codebook for Nepali documents classification
نویسندگان
چکیده
Document representation with outlier tokens exacerbates the classification performance due to uncertain orientation of such tokens. Most existing document methods in different languages including Nepali mostly ignore strategies filter them out from documents before learning their representations. In this article, we propose a novel method based on supervised codebook represent documents, where our contains only semantic without outliers. Our is domain-specific as it given corpus that have higher similarities class labels corpus. adopts simple yet prominent for each word, called probability-based word embedding. To show efficacy method, evaluate its task using Support Vector Machine and validate against widely used Bag Words, Latent Dirichlet allocation, Long Short-Term Memory, Word2Vec, Bidirectional Encoder Representations Transformers so on, four text datasets (we denote shortly A1, A2, A3 A4). The experimental results produces state-of-the-art (77.46% accuracy 67.53% 80.54% 89.58% A4) compared methods. It yields best three (A1, A2 A3) comparable fourth dataset (A4). Furthermore, introduce largest (A4), NepaliLinguistic dataset, linguistic community.
منابع مشابه
A Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملSemi-Supervised Classification Based on Low Rank Representation
Graph-based semi-supervised classification uses a graph to capture the relationship between samples and exploits label propagation techniques on the graph to predict the labels of unlabeled samples. However, it is difficult to construct a graph that faithfully describes the relationship between high-dimensional samples. Recently, low-rank representation has been introduced to construct a graph,...
متن کاملText Classification Based On Manifold Semi- Supervised Support Vector Machine
This article presents a solution along with experimental results for an application of semi-supervised machine learning techniques and improvement on the SVM (Support Vector Machine) based on geodesic model to build text classification applications for Vietnamese language. The objective here is to improve the semi-supervised machine learning by replacing the kernel function of SVM using geodesi...
متن کاملEfficient Vector Representation for Documents through Corruption
We present an efficient document representation learning framework, Document Vector through Corruption (Doc2VecC). Doc2VecC represents each document as a simple average of word embeddings. It ensures a representation generated as such captures the semantic meanings of the document during learning. A corruption model is included, which introduces a data-dependent regularization that favors infor...
متن کاملAn Improvement in Support Vector Machines Algorithm with Imperialism Competitive Algorithm for Text Documents Classification
Due to the exponential growth of electronic texts, their organization and management requires a tool to provide information and data in search of users in the shortest possible time. Thus, classification methods have become very important in recent years. In natural language processing and especially text processing, one of the most basic tasks is automatic text classification. Moreover, text ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: PeerJ
سال: 2021
ISSN: ['2167-8359']
DOI: https://doi.org/10.7717/peerj-cs.412